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Abstract 

We study how well one can recover sparse principal components of a data matrix using a sketch formed 
from a few of its elements. We show that for a wide class of optimization problems, if the sketch is 
close (in the spectral norm) to the original data matrix, then one can recover a near optimal solution 
to the optimization problem by using the sketch. In particular, we use this approach to obtain sparse 
principal components and show that for m data points in n dimensions, 0{e~^k max{m, n}) elements 
gives an e-additive approximation to the sparse PCA problem (k is the stable rank of the data matrix). 
We demonstrate our algorithms extensively on image, text, biological and financial data. The results 
show that not only are we able to recover the sparse PCAs from the incomplete data, but by using our 
sparse sketch, the running time drops by a factor of five or more. 


1 Introduction 


Principal components analysis constructs a low dimensional subspace of the data such that projection 
of the data onto this subspace preserves as much information as possible (or equivalently maximizes 
the variance of the projected data). The earliest reference to principal components analysis (PCA) is in 


PearsonI 1190111 . Since then, PCA has evolved into a classic tool for data analysis. A challenge for the 


interpretation of the principal components (or factors) is that they can be linear combinations of all the 
original variables. When the original variables have direct physical significance (e.g. genes in biological 
applications or assets in financial applications) if is desirable fo have factors which have loadings on only 
a small number of fhe original variables. These inferprefable factors are sparse principal components 
(SPCA). 

The quesfion we address is nof how fo beffer perform sparse PCA; rafher, if is whefher one can perform 
sparse PCA on incomplete data and be assured some degree of success. (Read: can one do sparse PCA 
when you have a small sample of dafa poinfs and fhose dafa poinfs have missing fealures?). Incomplefe 
dafa is a sifuafion fhaf one is confronted wifh all loo oflen in machine learning. For example, wilh user- 
recommendafion dafa, one does nof have all fhe rafings of any given user. Or in a privacy preserving 
selling, a clienf may nof wanf to give you all enfries in fhe dafa malrix. In such a selling, our goal is fo 
show fhaf if fhe samples fhaf you do gel are chosen carefully, fhe sparse PCA fealures of fhe dafa can 
be recovered wilhin some provable error bounds. A significanl part of Ihis work is to demonslrale our 
algorilhms on a variely of dafa sels. 

More formally, The dafa malrix is A G (jyi data points in n dimensions). Data matrices often 

have low effective rank. Let A^ be the best rank-A: approximation to A; in practice, it is often possible to 
choose a small value of k for which 11A — A^ 11 2 is small. The best rank-/c approximation A^ is obtained by 
projecting A onto the subspace spanned by its top-fc principal components V^, which is the n x fc matrix 
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containing the top-k right singular vectors of A. These top-A: principal components are the solution to the 
variance maximization problem: 

Vfc = argmax trace(V’^A^AV). 

VgR»ixfc^V'^V=I 

We denote the maximum variance attainable by OPT^, which is the sum of squares of the top-A: singular 
values of A. To get sparse principal components, you add a sparsity constraint to the optimization prob¬ 
lem: every column of V should have at most r non-zero entries (the sparsity parameter r is an input), 


Sfc = argmax 

VeK"x'=,V^V=I,||V(»)||o<r 


trace(V^A'^AV). 


( 1 ) 


The sp arse PC A problem is itself a very hard problem that is not only NP-hard , but also inapprox 


imable iMagdon-Ismail 


1995. Trendafilov et al 


2003 


2015 1 There are many heuristics for obtaining sparse factors ICadima and Jolliffe 


Zou et all 120061. Id’Aspremont et all 120071120081 IMoghaddam et all 120061 


Shen and Huand. 1200811 including some approximation algorithms with provable guarantees lAsteris et al. 
ll2014ll . The existing research typically addresses the task of getting just the top principal component 


(k = 1). While the sparse PCA problem is hard and interesting, it is not the focus of this work. 

We address the question: What if you do not know A, but only have a sparse sampling of some of the 
entries in A (incomplete data)? The sparse sampling is used to construct a sketch of A, denoted A. There 
is not much else to do but solve the sparse PCA problem with the sketch A instead of the full data A to 
get Sfc, 

Sfc = argmax trace(V^A AV). (2) 

VeIR"x'=,V^V=I,||V«||o<-r 


We study how performs as an approximation to with respective to the objective that we are trying 
to optimize, namely trace(S^ A^ AS) — the quality of approximation is measured with respect to the true 
A. We show that the quality of approximation is controlled by how well A A approximates A A as 
measured by the spectral norm of the deviation A A — A A. This is a general result that does not rely 
on how one constructs the sketch A. 


Theorem 1 (Sparse PCA from a Sketch) Let be a solution to the sparse PCA problem that solves 
(ED, and Sfc a solution to the sparse PCA problem for the sketch A which solves (12). Then, 


trace(Sfc A^ASfc) > tracc(S^A^ASfc) - 2A;||A^A - A A||2. 

Theorem [T] says that if we can closely approximate A with A, then we can compute, from A, sparse 
components which capture almost as much variance as the optimal sparse components computed from the 
full data A. 

In our setting, the sketch A is computed from a sparse sampling of the data elements in A (incomplete 
data). To determine which elements to s ample, and how to form the sketch, we leverage some recent 
results in elementwise matrix completion dKundu et al.l 1201511 1. In a nutshell, if one samples larger data 
elements with higher probability than smaller data elements, then, for the resulting sketch A, the error 
IIA A — A AII 2 will be small. The details of the sampling scheme and how the error depends on the 
number of samples is given in Section IZTl Combining the bound on ||A — A ||2 from Theorem l4] in 
Section l2dl with Theorem EJ we get our main result: 


Theorem 2 (Sampling Complexity for Sparse PCA) Sample s data-elements from A G ]^rnxn fofQpjfi 
the sparse sketch A using Algorithm E] Let be a solution to the sparse PCA problem that solves (ED, 
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and let Sk, which solves m, be a solution to the sparse PCA problem for the sketch A formed from the s 
sampled data elements. Suppose the number of samples s satisfies 


s > 



m + n 


(p^ and 7 are dimensionless quantities that depend only on A ). Then, with probability at least 1 — 5 
trace{S^ASk) > trace{S^A^ASk) — e (2 + e/A:)||A|| 2 . 

The dependence of and 7 on A are given in Section lZTl Roughly speaking, we can ignore the term with 
7 since it is multiplied by e/k, and = 0{k max{m, n}), where k is the stable (numerical) rank of A. 
To paraphrase Theorem |2l when the stable rank is a small constant, with 0{k‘^ max{m, n}) samples, one 
can recover almost as good sparse principal components as with all data (the price being a small fraction 
of the optimal variance, since OPT^ > || AII 2 ). As far as we know, this is the first result to show that it is 
possible to provably recover sparse PCA from incomplete data. We also give an application of Theorem [T] 
to running sparse PCA after “denoising” the data using a greedy thresholding algorithm that sets the small 
elements to zero (see Theorem O. Such denoising is appropriate when the observed matrix has been 
element-wise perturbed by small noise, and the uncontaminated data matrix is sparse and contains large 
elements. We show that if an appropriate fraction of the (noisy) data is set to zero, one can still recover 
sparse principal components. This gives a principled approach to regularizing sparse PCA in the presence 
of small noise when the data is sparse. 

Not only do our algorithms preserve the quality of the sparse principal components, but iterative 
algorithms for sparse PCA, whose running time is proportional to the number of non-zero entries in the 
input matrix, benefit from the sparsity of A. Our experiments show about five-fold speed gains while 
producing near-comparable sparse components using less than 10 % of the data. 


Discussion. In summary, we show that one can recover sparse PCA from incomplete data while gaining 
computationally at the same time. Our result holds for the optimal sparse components from A versus 
from A. One cannot efficiently find these optimal components (since the problem is NP-hard to even 
approximate), so one runs a heuristic, in which case the approximation error of the heuristic would have 
to be taken into account. Our experiments show that using the incomplete data with the heuristics is just 
as good as those same heuristics with the complete data. 

In practice, one may not be able to sample the data, but rather the samples are given to you. Our result 
establishes that if the samples are chosen with larger values being more likely, then one can recover sparse 
PCA. In practice one has no choice but to run the sparse PCA on these sampled elements and hope. Our 
theoretical results suggest that the outcome will be reasonable. This is because, while we do not have 
specific control over what samples we get, the samples are likely to represent the larger elements. For 
example, with user-product recommendation data, users are more likely to rate items they either really 
like (large positive value) or really dislike (large negative value). 


Notation. We use bold uppercase (e.g., X) for matrices and bold lowercase (e.g., x) for column vectors. 
The z-th row of X is X(j), and the z-th column of X is X^*). Let [n] denote the set {1,2,..., zz}. E(A) is 
the expectation of a random variable X; for a matrix, E(X) denotes the element-wise expectation. For a 
matrix X G the Frobenius norm ||X||^ is ||X|||. = YlTj=i spectral (operator) norm 

IIXII 2 is IIXII 2 = max||y||^=i ||Xy|| 2 . We also have the £i and £q norms: ||X||^^ = 

||X||q (the number of non-zero entries in X). The k-th largest singular value of X is (Tfc(X). and log x is 
the natural logarithm of x. 
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2 Sparse PCA from a Sketch 


In this section, we will prove Theorem [T] and give a simple application to zeroing small fluctuations as a 
way to regularize to noise. In the next section we will use a more sophisticated way to select the elements 
of the matrix allowing us to tolerate a sparser matrix (more incomplete data) but still recovering sparse 
PCA to reasonable accuracy. 

Theorem [T] will be a corollary of a more general result, for a class of optimization problems involving a 
Lipschitz objective function over an arbitrary (not necessarily convex) domain. Let /(V, X) be a function 
that is defined for a matrix variable V and a matrix parameter X. The optimization variable V is in some 
feasible set S which is arbitrary. The parameter X is also arbitrary. We assume that / is Lipschitz in X 
with Lipschitz constant 7. So, 

|/(V,X)-/(V,X)| <7(X)||X-X||2 VVg5. 

(Note we allow the Lipschitz constant to depend on X but not V.) The next lemma is the key tool we 
need to prove Theorem [T] and it may be on independent interest in other optimization settings. We are 
interested in maximizing /(V, X) w.r.t. V to obtain V*. But, we only have an approximation X for X, 
and so we maximize /(V, X) to obtain V , which will be a suboptimal solution with respect to X. We 
wish to bound /(V*, X) — /(V , X) which quantifies how suboptimal V is w.r.t. X. 

Lemma 1 (Surrogate optimization bound) Let /(V, X) be ^-Lipschitz w.r.t. X over the domain V G 
S. Define 


V* = argmax/(V, X); V* = argmax/(V, X). 

Then, 

/(V*,X) - /(V*,X) < 27 (X)||X - XII 2 . 

In the lemma, the function / and the domain S are arbitrary. In our setting, X G the domain 

5 = {V G V^V = Ifc; ||V(^)||o < r], and f{V,X.) = trace(V^XV). We first show that / is 

Lipschitz w.r.t. X with 7 = /c (a constant independent of X). Let the representation of V by its columns 
be V = [vi,..., Vfc]. Then, 

k 

Itrace(V'^XV) - trace(V^XV)| = |trace((X - X)VV'^)| < ^ afi'K - X) < A;||X - X ||2 

i=l 

where, afiA) is the i-th largest singular value of A (we used Von-neumann’s trace inequality and the fact 
that VV^ is a /c-dimensional projection). Now, by Lemma[Il 

trace(V*^XV*) - trace(V*'^XV*) < 2A:||X - X|| 2 . 

Theorem[T]follows by setting X = A^A and X = A A. 

Greedy thresholding. We give the simplest scenario of incomplete data where Theorem [T] gives some 
reassurance that one can compute good sparse principal components. Suppose the smallest data elements 
have been set to zero. This can happen, for example, if only the largest elements are measured, or in a 
noisy setting if the small elements are treated as noise and set to zero. So 

^ _ j-^ij ^ ^'1 

~ lo |A,,| < ,5. 
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Recall k = || A||p/||AH^ (stable rank of A), and define ||A 5 |||, = X]|Aij|<(5^et A = A + A. By 
construction, ||A|||, = ||A 5 |||,. Then, 

||A^A - A^A||2 = ||A^A + A^A - A^A||2 < 2||A||2||A||2 + ||A||i. (3) 

Suppose the zeroing of elements only loses a fraction of the energy in A, i.e. 6 is selected so that || A^Hl, < 
e^||A|||,/k; that is an e/k fraction of the total variance in A has been lost in the unmeasured (or zero) 
data. Then 

||A||2 < IIAIIf < —||A||ir = e||A||2. 

Vk 

Theorem 3 Suppose that A is created from A by zeroing all elements that are less than 6, and 5 is such 
that the truncated norm satisfies HA^Hl < e^|| A|||./k. Then the sparse PC A solution V satisfies 

trace(V*'^AAV*) > trace{V*'^AA^V*) - 2A:e|| A|||(2 + e). 

Theorem [3] shows that it is possible to recover sparse PCA after setting small elements to zero. This is 
appropriate when most of the elements in A are small noise and a few of the elements in A contain large 
data elements. For example if your data consists of sparse 0{y/nm) large elements (of magnitude, say, 1) 
and many nm — 0{^fnm) small elements whose magnitude is o(l/-^nm) (high signal-to-noise setting), 
then IIA^ II 2 /II All! —>• 0 and with just a sparse sampling of the 0{yjnm) large elements (very incomplete 
data), one recovers near optimal sparse PCA. 

Greedily keeping only the large elements of the matrix requires a particular structure in A to work, 
and it is based on a crude Frobenius-norm bound for the spectral error. In Section 12.11 we use recent 
results in element-wise matrix sparsification to choose the elements in a randomized way, with a bias 
toward large elements. With high probability, one can directly bound the spectral error and hence get 
better performance. But first, let us prove Lemma [T] 

A Proof of Lemma m We need the following lemma. 

Lemma 2 Let f and g be functions on a domain S. Then, 

sup/(x) - supg{y) < sup(/(x) - g{x)). 
x€S ySS xSS 


Proof: 

sup(/(x) - g{x)) > f{x) - g{x) > f{x) - supg{y), Vx G S. 

xGS y€S 

Since the RHS holds for all x, it follows that sup^^g{f {x) — g{x)) is an upper bound for f{x) — 
supyg 5 9 (U), and hence 

sup(/(x) - g{x)) > sup /(x) - supg{y) 

xGS x£S y y£S J 

= sup f{x) - sup g{y). 
xGS y&S 

O 
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Algorithm 1 Hybrid (£i,£ 2 )-Element Sampling 
Input: A G #samples s; probabilities {pij}- 

1: Set A = Omxn- 

2: for i = 1... s (i.i.d. trials with replacement) do 

3: Randomly sample indices £ [m] x [n] with P = {hj)] = Pij- 

_ _ _ A- 

4: Update A: Aj,- •(— Aij H- 

S- Pij 

5: return A (with at most s non-zero entries). 


Proo/.-(Lemma [B Suppose that maxves /(V, X) is attained at V* and maxves /(V, X) is attained at 
V*, and define e = /(V*, X) - /(V*, X). We have that 

e = /(V*,X)-/(V*,X) + /(V*,X)-/(V*,X) 

= max/(V,X) -max/(U,X) + /(V*,X) -/(V*,X) 

< max(/(V,X)-/(V,X)) +/(V*,X)-/(V*,X), 


where the last step follows from LemmalU Therefore, 


e < max 

V 


/(V,X)-/(V,X) +|/(V ,X)-/(V ,X) 


< max7(X)||X-X||2 + 7(X)||X-X||2 


= 27(X)||X-X||2. 

(We used the Lipschitz condition in the second step.) 


2.1 An (£i, £ 2 )-Sampling Based Sketch 

In the previous section, we created the sketch by deterministically setting the small data elements to zero. 
Instead, we could randomly select the data elements to keep. It is natural to bias this random sampling 
toward the larger elements. Therefore, we define sampling probabilifies for each dafa elemenf Aij which 
are proportional fo a mixfure of fhe absolute value and square of fhe dafa elemenf: 


Pij — OL~ 




+ (1 — a)' 


A^. 




|2 • 
If 


(4) 


where a G (0,1] is a mixing parameter. Such a sampling probabilify was used in IKundu ef al.l 11201511 fo 
sample dafa elemenfs in independenf frials fo gef a skefch A. We repeaf fhe profofypical algorifhm for 
elemenf-wise mafrix sampling in Algorifhm [T] 

Note fhaf unlike wifh fhe deferminisfic zeroing of small elemenfs, in fhis sampling scheme, one samples 
fhe elemenf Aij wifh probabilify pij and fhen rescales if by l/pij. To see fhe infuifion for fhis rescaling, 
consider fhe expecfed oufcome for a single sample: 


®[Ajj] — Pij • Aij/Pij') “h (1 Pij^ ■ 0 — ^ij’ 

fhaf is, A is a sparse buf unbiased esfimafe for A. This unbiasedness holds for any choice of fhe sampling 
probabilities pij defined over fhe elemenfs of A in Algorifhm [T] However, for an appropriate choice of 
fhe sampling probabilities, we gef much more fhan unbiasedness; we can confrol fhe specfral norm of fhe 
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deviat ion, ||A — A|| 2 . In particular, the hybrid-(£i, £ 2 ) distribution in (|4ll was analyzed in IKundu et al 
120151], where they suggest an optimal choice for the mixing parameter a* which minimizes the theoretical 
bound on IIA — AII 2 . This algorithm to choose a* is summarized in Algorithm |2l 

Using the probabilities in dUl to create the sketch A using Algorithm [T] with a* selected using Algo 
rithm 2] one can prove abound for ||A —A|| 2 . We state a simplified version of the bound from iKundu et al. 
1 2015 1 in Theorem m 


Theorem 4 dKundu et alj KOlSh ) Let A G ]gmxn let e > 0 be an accuracy parameter. Define 
probabilities pij as in (j?]) with a* chosen using Algorithm |2] Let A be the sparse sketch produced using 
Algorithm\I\with a number of samples 


- ^ +7e/3) log 


m + n 


where 


P^ = 


k ■ max{m, n} 


a ■ k ■ 


+ (1 - a) 


and 7 < 1 + 


'V mnk 


a 


Then, with probability at least 1 — 6, 


IA - AII 2 < e IIA 


12 • 


Proof: Follows from the bound in lKundu et al.l 12015h . o 

Recall that k is the stable rank of A. In practice, a* is bounded away from 0 and I, and so s = 
0(e“^fe max{m, n}) samples suffices fo gef a skefch A for which ||A — A ||2 < e||A||. This is exacfly 
whaf we need fo prove Theorem |2] 


Proof of Theorem ID The number of samples s in Theorem |2] corresponds fo fhe number of samples 
needed in Theorem |4] wifh fhe error tolerance e/k. Using (l3]l (where A = A — A) and Theorem HI we 
have fhaf 


I A-' A - A A| 


Using ([5]) in Theorem [T] gives Theorem |2] 


2 < 


— \ 


A||^ + 


kf 


IAI|2 


(5) 


3 Experiments 

We show fhe experimenfal performance of sparse PCA from a skefch using several real dafa mafrices. As 
we mentioned, sparse PCA is NP-Hard, and so we musf use heurisfics. These heurisfics are discussed 
nexf, followed by fhe dafa, fhe experimenfal design and finaly fhe resulfs. 


3.1 Algorithms for Sparse PCA 

Lef Q (ground frufh) denofe fhe algorifhm which compufes fhe principal componenfs (which may nof be 
sparse) of fhe full dafa mafrix A; fhe opfimal variance is OPTfc. We consider six heuristics for getting 
sparce principal componenfs. 


Gmax 

Gsp,r 

n 
n 


max,r 

sp,r 


U, 

u. 


max,r 

sp,r 


The r largesf-magnifude enfries in each principal componenf generafe d by G- 
r-sparse componenfs using fhe Spasm toolbox of ISisfrand ef al.l 1201211 wifh A. 

The r largesl enfries of fhe principal componenfs for fhe (£ 1 , £ 2 )-sampled skefch A. 
r-sparse componenfs using Spasm wifh fhe (£ 1 , £ 2 )-sampled skefch A. 

The r largesf enfries of fhe principal componenfs for fhe uniformly sampled skefch A. 
r-sparse componenfs using Spasm wifh fhe uniformly sampled skefch A. 
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Algorithm 2 Optimal Mixing Parameter a* 

Input: A G 

1 : Define two functions of a that depend on A: 


p^{a)= max max ^ Cij, max X] ^ 

f=i 


7 ( 0 ;) = max < 


^, 3 - 
AijT^O 


2=1 


l£i 


a + {1 — a) 




^ ! 


where, 


iij = l|A||^/ 


a • IIAI 


|Aij| • ||A||£^ 


+ (1 - a) , for Aij / 0 . 


2 : Find a* G (0,1] to minimize p^{a) + 7 (a)e || AII 2 /3. 

3: return a* 


The outputs of an algorithm Z are sparse principal components V, and the metric we are interested in 
is the variance, f{Z) = trace(V^A^AV), where A is the original centered data. We consider the fol¬ 
lowing statistics. 


f{Gn 


f {Gsp,r) 

f (^max/sp,r) 
/ (^max/sp,r) 

/ (^max/sp,r) 
f (0max/sp,r) 


Relative loss of greedy thresholding versus Spasm, illustrating the value of a good 
sparse PCA algorithm. Our sketch based algorithms do not address this loss. 

Relative loss of using the (f*!, £ 2 )-sketch A instead of complete data A. A ratio close 
to 1 is desired. 

Relative loss of using the uniform sketch A instead of complete data A. A benchmark 
to highlight the value of a good sketch. 


We also report on the computation time for the algorithms. We show results to confirm that sparse PCA 
algorithms using the (£ 1 , ^ 2 )-sketch are nearly comparable to those same algorithms on the complete data; 
and, computing from a sparse sketch has a running time that is reduced proportionately to the sparsity. 


3.2 Data Sets 

We show results on image, text, stock, and gene expression data. We briefly describe the datasets below. 


Digit Data {m = 2313, n = 256): We use the IHullI 11 199411 handwritten zip-code digit images (300 
pixels/inch in 8 -bit gray scale). Each pixel is a feature (normalized to be in [—1,1]). Each 16 x 16 digit 
image forms a row of the data matrix A. We focus on three digits: “ 6 ” (664 samples), “9” (644 samples), 
and “1” (1005 samples). 

TechTC Data ( m = 139, n = 15170): We use t he Technion Repository of Text Categorization 
Dataset (TechTC, see Gabrilovich and Markovitch 1 2004 11 from the Open Directory Project (ODP). Each 
documents is represented as a probability distribution over a bag-of-words, with words being the features 




















- we removed words with fewer than 5 letters. Each of the 139 documents forms a row in the data. 

Stock Data (m = 7056, n = 1218): We use S&PlOO stock market data of prices for 1218 stocks 
collected between 1983 and 2011. This temporal dataset has 7056 snapshots of stock prices. The prices 
of each day form a row of the data matrix and a principal component represents an “index” of sorts - each 
stock is a feature. 

Gene Expression Data (m = 107, n = 22215): We use GSE10072 gene expression data for lung 
cancer from the NCBI Gene Expression Omnibus database. There are 107 samples (58 lung tumor cases 
and 49 normal lung controls) forming the rows of the data matrix, with 22,215 probes (features) from the 
GPE96 platform annotation table. 

3.3 Results 

We report results for primarily the top principal component {k = 1) which is the case most considered in 
the literature. When k > 1, our results do not qualitatively change. 


Handwritten Digits. Using Algorithm |2l the optimal mixing parameter is a* = 0.42. We sample 
approximately 7% of the elements from the centered data using (£i,£ 2 )-sampling, as well as uniform 
sampling. The performance for small of r is shown in TableHl including the running time r. 


r 

f ("^max/sp,r) 
/ (0max/sp,r) 

T{g) 

T{n) 

/ (/^max/sp,r) 
/(6^max/sp,r-) 

T{g) 

t{U) 

20 

1.01/0.89 

6.03 

1.13/0.56 

4.7 

40 

0.99/0.90 

6.21 

1.01/0.70 

5.33 

60 

0.99/0.98 

5.96 

0.97/0.80 

5.33 

80 

0.99/0.95 

6.03 

0.94/0.81 

5.18 

100 

0.99/0.98 

6.22 

0.95/0.87 

5.08 


Table 1: [Digits] Comparison of sparse principal components from the (£i, l’ 2 )-sketch and uniform sketch. 


Eor this data, /(0max,r)//(0sp,r) ~ 0.23 (r = 10), so it is important to use a good sparse PGA 
algorithm. We see from Table [Uthat the (£i,£ 2 )-sketch significantly outperforms the uniform sketch. A 
more extensive comparison of recovered variance is given in PigureUfa). We also observe a speed-up of 
a factor of about 6 for the (fi, £ 2 )-sketch. We point out that the uniform sketch is reasonable for the digits 
data because most data elements are close to either +1 or —1, since the pixels are either black or white. 

We show a visualization of the principal components in Eigure [U We observe that the sparse compo¬ 
nents from the (^i, £ 2 )-sketch are almost identical to the sparse components from the complete data. 

TechTC Data. Algorithm |2] gives optimal mixing parameter a* = 1. We sample approximately 5% 
of the elements from the centered data using our (fi,£ 2 )-sampling, as well as uniform sampling. The 
performance for small r is shown in Table ID including the running time r. 

Eor this data, f {Gmax,r) /f {Gsp,r) ~ 0.84 (r = 10). We observe a very significant performance differ¬ 
ence between the (f'l, £ 2 )-sketch and uniform sketch. A more extensive comparison of recovered variance 
is given in Eigure |2tb). We also observe a speed-up of a factor of about 6 for the (£ 1 , £ 2 )-sketch. Unlike 
the digits data which is uniformly near ±1, the text data is “spikey” and now it is important to sample with 
a bias toward larger elements, which is why the uniform-sketch performs very poorly. 
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(a) r = 100% 



(b) r = 50% (c) r = 30% 



-OJ» 


I- 

1 - 0 ^ 



(d) r = 10% 


Figure 1: [Digits] Visualization of top-3 sparse principal components. In each figure, left panel shows 
^sp,r and right panel shows ^sp,r- 



*fin,p,r)/f(Qsp,r) 

■^.f{l^sp,r)/f(Qsp.r) 


20 40 60 80 100 

Sparsity constraint: r (percent) 

(a) Digit 



Sparsity constraint; r (percent) 


(b) TechTC 



Sparsity constraint: r (percent) 


(c) Stock 



Sparsity constraint: r (percent) 

(d) Gene 


Figure 2: Performance of sparse PCA for (f*!, £ 2 )-sketch and uniform sketch over an extensive range for 
the sparsity constraint r. The performance of the uniform sketch is significantly worse highlighting the 
importance of a good sketch. 


r 

/ (2fmax/sp,r) 
f (^max/sp,7’) 

t{Q) 

Tin) 

/ (^max/sp,r) 
f (^max/sp,r ) 

T(e) 

t{U) 

20 

0.94/0.98 

5.43 

0.43/0.38 

5.64 

40 

0.94/0.99 

5.70 

0.41/0.38 

5.96 

60 

0.94/0.99 

5.82 

0.40/0.37 

5.54 

80 

0.93/0.99 

5.55 

0.39/0.37 

5.24 

100 

0.93/0.99 

5.70 

0.38/0.37 

5.52 


Table 2: [TechTC] Comparison of sparse principal components from the (£ 1 , £ 2 )-sketch and uniform 
sketch. 


As a final comparison, we look at the actual sparse top component with sparsity parameter r = 10. 
The topic IDs in the TechTC data are 10567=”US: Indiana: Evansville” and 11346=”US: Florida”. The 
top- 10 features (words) in the full PCA on the complete data are shown in Tabled 

In Table m we show which words appear in the top sparse principal component with sparsity r = 10 
using various sparse PCA algorithms. We observe that the sparse PCA from the (£ 1 , f' 2 )-sketch with only 
5% of the data sampled matches quite closely with the same sparse PCA algorithm using the complete 

data (^max/sp,r matches 7fmax/sp,r)- 
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ID 

Top 10 in ^inax,r 

ID 

Other words 

1 

evansville 

11 

service 

2 

florida 

12 

small 

3 

south 

13 

frame 

4 

miami 

14 

tours 

5 

Indiana 

15 

faver 

6 

information 

16 

transaction 

7 

beach 

17 

needs 

8 

lauderdale 

18 

commercial 

9 

estate 

19 

bullet 

10 

spacer 

20 

inlets 



21 

producer 


Table 3: [TechTC] Top ten words in top principal component of the complete data (the other words are 
discovered by some of the sparse PCA algorithms). 


^max,r 

^max,r 

^max,r 

0sp,r 


^sp,r 

1 

1 

6 

1 

1 

6 

2 

2 

14 

2 

2 

14 

3 

3 

15 

3 

3 

15 

4 

4 

16 

4 

4 

16 

5 

5 

17 

5 

5 

17 

6 

7 

7 

6 

7 

7 

7 

6 

18 

7 

8 

18 

8 

8 

19 

8 

6 

19 

9 

11 

20 

9 

12 

20 

10 

12 

21 

13 

11 

21 


Table 4: [TechTC] Relative ordering of the words (w.r.t. Grnax,r) in the top sparse principal component 
with sparsity parameter r = 10. 


Stock Data. Algorithm |2] gives optimal mixing parameter a* = 0 . 10 . We sample about 2% of the 
non-zero elements from the centered data using our (^i,£ 2 )-sampling, as well as uniform sampling. The 
performance for small r is shown in Tabled including the running time r. 

For this data, f {Gmax,r)/f {Gsp,r) ~ 0.96 (r = 10). We observe a very significant performance differ¬ 
ence between the (£i, £ 2 )-sketch and uniform sketch. A more extensive comparison of recovered variance 
is given in Figure|2c). We also observe a speed-up of a factor of about 4 for the (£i, £ 2 )-sketch. Similar 
to TechTC data this dataset is also “spikey”, and consequently biased sampling toward larger elements 
significantly outperforms the uniform-sketch. 

We now look at the actual sparse top component with sparsity parameter r = 10. The top-10 features 
(stocks) in the full PCA on the complete data are shown in Table [6] In Table |7] we show which stocks 
appear in the top sparse principal component using various sparse PCA algorithms. We observe that the 
sparse PCA from the (£i, ^ 2 )-sketch with only 2% of the non-zero elements sampled matches quite closely 
with the same sparse PCA algorithm using the complete data (0max/sp,r matches 4fmax/sp,r)- 

’we computed a* numerically in the range [0.1, 1]. 
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r 

f ("^max/sp,r) 
/ (0max/sp,r) 

T{g) 

T{n) 

/ (^max/sp,r) 
/(6^max/sp,r-) 

T{g) 

t{U) 

20 

1 .00/1.00 

3.85 

0.69/0.67 

4.74 

40 

1 .00/1.00 

3.72 

0 .66/0.66 

4.76 

60 

0.99/0.99 

3.86 

0.65/0.66 

4.61 

80 

0.99/0.99 

3.71 

0.65/0.66 

4.74 

100 

0.99/0.99 

3.63 

0.64/0.65 

4.71 


Table 5: [Stock data] Comparison of sparse principal components from the (fi, f' 2 )-sketch and uniform 
sketch. 


ID 

Top 10 in ^inax,r 

ID 

Other stocks 

1 

T.2 

11 

HET. 

2 

AIG 

12 

ONE.l 

3 

C 

13 

MA 

4 

UIS 

14 

XOM 

5 

NRTLQ 

15 

PHA.l 

6 

S.l 

16 

CE 

7 

GOOG 

17 

WY 

8 

MTLQQ 



9 

ROK 



10 

EK 




Table 6: [Stock data] Top ten stocks in top principal component of the complete data (the other stocks are 
discovered by some of the sparse PCA algorithms). 


0 max,r 

^max,r 

^max,r 

Gsp,r 

^sp,r 

^sp,r 

1 

1 

2 

1 

1 

2 

2 

2 

11 

2 

2 

11 

3 

3 

12 

3 

3 

12 

4 

4 

13 

4 

4 

13 

5 

5 

14 

5 

5 

14 

6 

6 

3 

6 

7 

3 

7 

7 

15 

7 

6 

15 

8 

9 

9 

8 

8 

9 

9 

8 

16 

9 

9 

16 

10 

11 

17 

10 

11 

17 


Table 7: [Stock data] Relative ordering of the stocks (w.r.t. Qmax,r) in the top sparse principal component 
with sparsity parameter r = 10. 


Gene Expression Data. Algorithm |2] gives optimal mixing parameter a* = 0.92. We sample about 
9% of the elements from the centered data using our (^i, £ 2 )-sampling, as well as uniform sampling. The 
performance for small r is shown in Table [8j including the running time r. 

For this data, /(^max,r)//(^sp,r) ~ 0.05 (r = 10) which means a good sparse PCA algorithm is 
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r 

f ("^max/sp,r) 
f (^max/sp,r) 

T{g) 

T{n) 

/ (^max/sp,r) 
/(6^max/sp,r-) 

T{g) 

t{U) 

20 

0.82/0.81 

3.76 

0.64/0.16 

2.57 

40 

0.82/0.88 

3.61 

0.65/0.15 

2.53 

60 

0.83/0.90 

3.86 

0.67/0.10 

2.85 

80 

0.84/0.94 

3.71 

0.68/0.11 

2.85 

100 

0.84/0.91 

3.78 

0.67/0,10 

2.82 


Table 8: [Gene data] Comparison of sparse principal components from the (£ 1 , f' 2 )-sketch and uniform 
sketch. 


ID 

Xop 10 in ^inax,r 

ID 

Other probes 

1 

210081 at 

II 

205866 at 

2 

214387 x at 

12 

209074 s at 

3 

211735 x at 

13 

2053II at 

4 

209875 s at 

14 

2I6379 x at 

5 

205982_x_at 

15 

20357I s at 

6 

215454_x_at 

16 

205I74_s_at 

7 

209613 s at 

17 

204846 at 

8 

210096 at 

18 

209116_x_at 

9 

204712_at 

19 

202834 at 

10 

203980 at 

20 

209425 at 



21 

2I5356 at 



22 

22I805 at 



23 

209942 x at 



24 

2I8450 at 



25 

202508_s_at 


Table 9: [Gene data] Top ten probes in top principal component of the complete data (the other probes are 
discovered by some of the sparse PGA algorithms). 


imperative. We observe a very significant performance difference between the (£ 1 , £ 2 )-sketch and uniform 
sketch. A more extensive comparison of recovered variance is given in Figure |2{d). We also observe a 
speed-up of a factor of about 4 for the (^ 1 , £ 2 )-sketch. Similar to TechTC data this dataset is also “spikey”, 
and consequently biased sampling toward larger elements significantly outperforms the uniform-sketch. 

Also, we look at the actual sparse top component with sparsity parameter r = 10. The top-10 features 
(probes) in the full PGA on the complete data are shown in Table |9l 

In Table [TOl we show which probes appear in the top sparse principal component with sparsity r = 10 
using various sparse PGA algorithms. We observe that the sparse PGA from the (£ 1 , £ 2 )-sketch with only 
9% of the elements sampled matches reasonably with the same sparse PGA algorithm using the complete 

data (^max/sp,r matches 4(niax/sp,r)- 

Finally, we validate the genes corresponding to the top probes in the context of lung cancer. Table 
[m lists the top twelve gene symbols in Table |9] Note that a gene can occure multiple times in principal 
component since genes can be associated with different probes. 

Genes like SFTPC, AGER, WIFI, and FABP4 are down-regulated in lung cancer, while SPPl is up- 
regulated (see the functional gene grouping at: WWW. sabiosciences . com/rt_pcr_product/HTML/PAHS-13 
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^max,r 

^max,r 

^max,r 

Gsp,r 


^sp,r 

1 

4 

13 

1 

4 

13 

2 

1 

14 

2 

1 

16 

3 

11 

3 

3 

2 

15 

4 

2 

15 

4 

11 

19 

5 

3 

5 

5 

3 

20 

6 

8 

16 

6 

8 

21 

7 

7 

6 

7 

7 

22 

8 

9 

17 

8 

9 

23 

9 

5 

4 

9 

5 

24 

10 

12 

18 

10 

12 

25 


Table 10: [Gene data] Relative ordering of the probes (w.r.t. ^max,r) in the top sparse principal component 
with sparsity parameter r = 10. 


^max,r 

1/ 

^max,r 

1/ 

^sp,r 

ly 

SETPC 

4 

SETPC 

3 

SETPC 

3 

AGER 

1 

SPPl 

1 

SPPl 

1 

SPP 1 

1 

AGER 

1 

AGER 

1 

ADHIB 

1 

ECN3 

1 

ECN3 

1 

CYP4B1 

1 

CYP4B1 

1 

CYP4B1 

1 

WIEl 

1 

ADHIB 

1 

ADHIB 

1 

PABP4 

1 

WIEl 

1 

WIEl 

1 



EAM107A 

1 

EAM107A 

1 


Table 11: [Gene data] Gene symbols corresponding to top probes in Table [TOl One gene can be associated 
with multiple probes. Here v is the frequency of occurrence of a gene in top ten probes of their respective 
principal component. 


Co-expression analysis on the set of eight genes for ?fmax,r and 'Hsp,r using the tool ToppFun (toppgene . cchmc . org 
shows that all eight gene s appe ar in a list of selected probes characterizing non-small-cell lung carcinoma 


(NSCLC) in IlHou et all 


2010. Tab l e SI] . Further, ACER and FAM107A appear in the top five highly 


discriminative genes in I Hou et al. . 20101 Table S3]. Additionally, ACER, ECN3, SPPl, and ADHIB 


ap pear among the 162 m ost differentiating genes across two subtypes of NSCEC and normal lung cancer 
in IIDracheva et all I2007L Supplemental Table 1]. Such findings show that our method can identify, from 
incomplete data, important genes for complex diseases like cancer. Also, notice that our sampling-based 
method is able to identify additional important genes, such as, PCN3 and PAM107A in top ten genes. 


3.4 Performance of Other Sketches 

We briefly report on other options for sketching A. Eirst, we consider suboptimal a (not a* from Algo¬ 
rithm llll in dUl to construct a suboptimal hybrid distribution. We use this distribution in proto-Algorithm 
fflto construct a sparse sketch. Eigure[3]reveals that a good sketch using the optimal a* is important. 

Second, another popular sketching method using element wise sparsification is to sample e lements not 


biasing toward larger elements but rather toward elements whose leverage scores are high. See lChen et al. 


1120141] for the detailed form of the leverage score sampling probabilities (which are known to work well 
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.f{nsp,r),a* =0.1 
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Sparsity constraint: r (percent) 


Figure 3: [Stock data] Performance of sketch using suboptimal a to illustrate the importance of the optimal 
mixing parameter a*. 



Sparsity constraint: r (percent) 


Sparsity constraint: r (percent) 


Sparsity constraint: r (percent) Sparsity constraint: r (percent) 


(a) Digit (rank 3) 


(b) TechTC (rank 2) (c) Stock (rank 3) 


(d) Gene (rank 2) 


Figure 4: [Low-rank data] Performance of sparse PCA of low-rank data for optimal (f*!, f' 2 )-sketch and 
leverage score sketch over an extensive range for the sparsity constraint r. The performance of the optimal 
hyrbid sketch is considerably better highlighting the importance of a good sketch. 


in other settings can be plugged into our proto-Algorithm [Til. Let A be a m x n matrix of rank p, and its 
SVD is given by A = UHV^. Then, we define pn (row leverage scores), Vj (column leverage scores), 
and element-wise leverage scores piev as follows: 


bi — ||U(j) II2) 


^3 = l|V0-)ll2, 


1 


Plev — 


bi + 


+ 


1 


2 {m + n)p 2mn 


f G |mj,j G |nj 


At a high level, the leverage score of element {i,j) is proportional to the squared norms of the fth 
row of the left singular matrix and the jth row of the right singular matrix. Such leverage score sampling 
is different from uniform sampling only for low rank matrices or low rank approximations to matrices, 
so we used a low rank approximation to the data matrix. We construct such low-rank approximation by 
projecting a dataset onto a low dimensional subspace. We notice that the datasets projected onto the space 
spanned by top few principal components preserve the linear structure of the data. For example. Digit data 
show good separation of digits when projected onto the top three PCA’s. For TechTC and Gene data the 
top two respective PCA’s are good enough to form a low-dimensional subspace where the datasets show 
reasonable separation of two classes of samples. For the stock data we use top three PCA’s because the 
stable rank is close to 2. 

Let /lsp,r be the r-sparse components using Spasm for the leverage score sampled sketch A. Figure |4] 
shows that leverage score sampling is not as effective as the optimal hybrid (£i,£ 2 )-sampling for sparse 
PCA of low-rank data. 
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Conclusion. It is possible to use a sparse sketch (incomplete data) to recover nearly as good sparse 
principal components as you would have gotten with the complete data. We mention that, while ^max 
which uses the largest weights in the unconstrained PCA does not perform well with respect to the vari¬ 
ance, it does identify good features. A simple enhancement to is to recalibrate the sparse component 
after identifying the features - this is an unconstrained PCA problem on just the columns of the data ma¬ 
trix corresponding to the features. This method of recalibrating can be used to improve any sparse PCA 
algorithm. 

Our algorithms are simple and efficient, and many interesting avenues for further research remain. 
Can the sampling complexity for the top-A; sparse PCA be reduced from 0{k‘^) to 0{k). We suspect 
that this should be possible by getting a better bound on iTj(A^A — A A); we used the crude 
bound A:|| A^A — A^A|| 2 . We also presented a general surro gate optimization bound which may be of 
interest in other applications. In particular, it is pointed out in iMagdon-Ismail and BoutsidisI 1201511 that 
though PCA optimizes variance, a mo re natural way to look at PCA i s as th e linear projection of the data 
that minimizes the information loss. iMagdon-Ismail and BoutsidisI 120ISH gives efficient algorithms to 
find sparse linear dimension reducfion fhaf minimizes informafion loss - fhe informalion loss of sparse 
PCA can be considerably higher fhan optimal. To minimize informafion loss, fhe objecfive fo maximize 
is /(V) = frace(A^AV(AV)^A). If would be interesting fo see whefher one can recover sparse low- 
informalion-loss linear projecfors from incomplefe dafa. 
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